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Abstract 


In machine learning, models are derived from labeled training data where 
labels signify classes and features define sample attributes. However, noise 
from data collection can impair the algorithm’s performance. Blanco, 
Japon, and Puerto proposed mixed-integer programming (MIP) models 
within support vector machines (SVM) to handle label noise in training 
datasets. Nonetheless, it is imperative to underscore that their models 
demonstrate an observable escalation in the number of variables as sample 
size increases. The nonparallel support vector machine (NPSVM) is a bi- 
nary Classification method that merges the strengths of both SVM and twin 
SVM. It accomplishes this by determining two nonparallel hyperplanes by 
solving two optimization problems. Each hyperplane is strategically po- 


sitioned to be closer to one of the classes while maximizing its distance 
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from the other class. In this paper, to take advantage of NPSVM’s fea- 
tures, NPSVM-based relabeling (RENPSVM) MIP models are developed 
to deal with the label noises in the dataset. The proposed model adjusts 
observation labels and seeks optimal solutions while minimizing compu- 
tational costs by selectively focusing on class-relevant observations within 
an e-intensive tube. Instances exhibiting similarities to the other class are 
excluded from this c-intensive tube. Experiments on 10 UCI datasets show 
that the proposed NPSVM-based MIP models outperform their counter- 


parts in accuracy and learning time on the majority of datasets. 


AMS subject classifications (2020): Primary 6BT09; Secondary 90C11. 


Keywords: Label noise; SVM; Mixed-integer program; Nonparallel SVM. 


1 Introduction 


Support Vector Machine (SVM) [8, 34, 35] is a renowned technique em- 
ployed for binary classification in diverse domains, such as abnormal recog- 
nition [22], stock market prediction [1], and pose estimation [36]. Despite 
its proficient performance, SVM encounters substantial computational de- 
mands when solving the Quadratic Programming Problem (QPP) for large 
datasets. In response to this challenge, Jayadeva, Khemchandani, and Chan- 
dra [19] introduced the Twin SVM (TWSVM), a method that utilizes two 
nonparallel hyperplanes. These hyperplanes are positioned closer to each 
of the two classes while maintaining a minimum unit distance from sam- 
ples of the other class. In contrast to SVM, TWSVM tackles two smaller 
QPPs, thereby mitigating the training time complexity. The TWSVM frame- 
work has been extended through various adaptations, including the Wavelet 
TWSVM by Ding et al. [11, 12] with glowworm swarm optimization, an 
enhanced K-nearest neighbor TWSVM by Nasiri and Mir [26] addressing 
noise and outliers, and an automatic TWSVM by Jimenez-Castano, Alvarez- 
Meza, and Orozco-Gutierrez [20] for imbalanced datasets using kernel rep- 
resentation. Although TWSVM offers valuable attributes, it encounters a 
challenge in computing the inverses of specific matrices as part of its model 


training process. This task becomes impractical or even infeasible for siz- 
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able datasets when utilizing conventional methods. Conversely, the stan- 
dard SVM can efficiently solve large-scale problems through algorithms like 
Sequential Minimal Optimization (SMO). To address this concern, Nonpar- 
allel SVM (NPSVM) [33] is introduced, which integrates SVM’s benefits into 
TWSVM. This integration incorporates the utilization of the SMO algorithm 
(21, 28] and the concept of semi-sparseness [33], collectively enhancing the 


overall performance of the model. 


The existence of label noise within datasets can have a substantial 
impact on the accuracy and generalizability of supervised learning algo- 
rithms. Instances with incorrect labels may originate from diverse origins, 
such as human errors, label switching, or intentional introduction of noise 
[25, 23, 24, 27]. It has been the subject of several research studies. An- 
gluin and Laird [2] introduced a noise model that establishes a sample for 
learning in noisy environments. They also suggested computationally feasible 
learning algorithms for noisy domains and explored extending these concepts 
to broader contexts. However, a drawback of their model is the question 
of whether there are domains where approximately correct identification is 
computationally feasible without noise. However, it becomes computation- 
ally infeasible even with moderate levels of noise. Another study by Xiao 
et al. [38] devised an optimal attack strategy and used heuristic methods 
for practical computation. Biggio, Nelson, and Laskov [4] introduced an al- 
gorithmic strategy that effectively manages adversarial alterations of labels. 
This technique involves the adjustment of the kernel matrix when labels are 
independently flipped with equal probabilities. Another alternative, as pre- 
sented in a prior work [15], entails the process of detecting and eliminating 
inaccurately labeled instances. This involves the selection of samples con- 
sidered dubious and necessitating additional scrutiny. Obtaining labels with 
reduced levels of noise might entail increased time and expenses. Neverthe- 
less, this endeavor holds the potential to significantly augment classification 
accuracy. To address this challenge, Duan and Wu [13] proposed a novel 
learning approach that leverages both noisy and less noisy labels extracted 
from a limited portion of the training dataset. This methodology involves 
the estimation of noise rate parameters and the inference of precise labels by 


utilizing a noise model built upon flipping probabilities and a logistic regres- 
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sion classifier. While the methods presented in [4, 15] effectively tackle label 
noise in data, they do so through a two-phase process. The aspect related to 
label noise is addressed before the training process, and the models do not 
handle label noise simultaneously with the primary task. Thulasidasan et 
al. [31] introduced an innovative approach to mitigate label noise within the 
context of Deep Neural Network (DNN) classification. Their methodology 
involves the introduction of a new loss function, enabling the DNN to decide 
to abstain from classifying certain samples, thereby avoiding confusion. At 
the same time, this approach improves the classification performance of sam- 
ples that are not abstained from. The proposed method holds substantial 
promise for considerably augmenting the accuracy and robustness of DNN 
classifiers in real-world practical applications. In [7], the authors presented 
a technique for estimating the level of label noise and showed that imple- 
menting importance reweighting can enhance classification accuracy when 
dealing with label noise and evaluates the reliability of two classification ap- 
proaches: Convolutional neural networks and convolutional neural networks 
with importance reweighting. Despite the merits of models in [31, 7], they 
suffer from the explosion of parameters as the number of layers increases 
for some tasks, such as natural language processing and computer vision, 
which results in demanding high computation resources. Blanco, Japén, and 
Puerto [5] introduced a unique approach for constructing optimal classifi- 
cation trees that consider the presence of noisy labels in the training data. 
Their method combines margin-based classifiers with outlier detection tech- 
niques to improve performance. It utilizes two main components: (1) the 
splitting rules of the tree are designed to maximize class separation margins, 
following SVM principles, and (2) during tree construction, some training 
sample labels can be adjusted to identify and address label noise. These 
elements are integrated to create the final optimal classification tree. Bertsi- 
mas et al. [3] introduced a robust optimization approach for addressing label 
noise by introducing a new variable representing the probability of mislabel- 
ing for each training point. They also imposed a constraint to limit the total 
number of mislabeled points below a specified threshold, considering worst- 
case scenarios. However, a limitation of their method is its primary focus on 


constructing a classifier robustly, concentrating on worst-case scenarios, and 
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controlling label noise with a budget hyperparameter. Also, their approach 
prioritizes worst-case scenarios and does not explore all possible parameter 


vectors (w, b) for different scenarios. 


Recently, in [6], the authors have formulated SVM-based mixed-integer 
programming (MIP) models to effectively handle classification tasks in the 
presence of noisy labels. In contrast to the existing techniques, their method- 
ology involves a simultaneous process of constructing an SVM-based classifier 
while adjusting the labels of observations to achieve an optimal solution. A 
significant advantage of their approach lies in its capability to derive separat- 
ing hyperplanes that conventional SVM methods cannot achieve. However, 
it is important to highlight that the Relabel SVM-based (RESVM) model 
introduced in [6] exhibits a noteworthy increase in the number of variables 
as the number of samples rises. To tackle this issue, they further proposed 
a clustering-based relabeling (CRESVM) model by employing clustering and 


classification in the SVM framework. 


In this paper, the relabeling idea is employed within NPSVM to take ben- 
efit of its features. Each proposed MIP model considers instances within the 
class represented by the e-intensive tube. If any of these instances share sim- 
ilarities with the other class, the samples belonging to the original class are 
excluded from the e-insensitive tube. In most datasets, RENPSVM exhibits 
fewer linear constraints and variables in comparison to RESVM. Moreover, 
while the CRESVM model introduced in [6] possesses fewer linear constraints 
and variables than both RESVM and RENPSVM, its demerit is the utiliza- 
tion of nonlinear constraints. Besides the above, the structure of NPSVM 
allows parallel implementation of the proposed MIP models, leading to faster 
learning times on most datasets. The main contributions of this paper are 


summarized as follows: 


(1) Expanding the NPSVM models into MIP models to address label noise 
in a manner that not only adjusts the labels of observations but also 


achieves an optimal solution simultaneously. 


(2) Minimizing the computational cost by avoiding the consideration of all 


observations as potential candidates for the label changes in the model. 
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Instead, we focus on instances related to the class that the model aims 


to represent within an epsilon (e)-intensive tube. 


(3) Parallelization of the proposed MIP models, which results in faster 


learning times on the majority of datasets. 


(4) Computational experiments conducted on 10 UCI datasets reveal that 
RENPSVM outperforms RESVM and CRESVM in terms of classifica- 
tion accuracy while demonstrating similar learning times to RESVM 
and CRESVM. 


(5) The outcomes of evaluating our algorithms on diverse real-world datasets 
demonstrate that our suggestions exhibit greater resilience against at- 


tacks than the recent relabel models approach mentioned in [6]. 


The rest of this paper is organized as follows. Section 2 briefly reviews 
TWSVM and NPSVM. In Section 3, we delve into our proposed model 
and provide a comparison of the number of constraints and variables with 
RESVM and CRESVM models in [6]. Moving on to Section 4, computational 
experiments are conducted on 10 UCI datasets to illustrate the efficiency of 
the proposed models in comparison to those outlined in [6]. Also, this section 
encompasses two statistical tests aimed at highlighting differences between 
the proposed model and those from [6]. Ultimately, Section 5 presents con- 


cluding remarks. 


2 Background 


Consider a classification problem with the dataset D = {(#1,y1), (#2, y2) 
,»---,(21,y1)}, where x; € R” and y; € {+1,—1} for i = 1,...,1 denote sam- 
ples and labels of samples, respectively. We further symbolize the sets of 
indices associated with positive and negative classes as J* and I~, respec- 


tively. This is defined as 
It={ily=+1}, I = {ily =—1}. 


In this section, we present a concise overview of the TWSVM and NPSVM 


models. 
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2.1 TWSVM 


Consider A € R™*" and B € R™*” as the data matrices containing points 
belonging to I+ and I~, respectively. The TWSVM functions as a binary 
classifier, establishing two nonparallel hyperplanes via solving two smaller 


QPPs compared to a large one in SVM as follows: 


min 4||Aw, == e€1b1||? + ce} £5 


Wi; bi, &2 
s.t. —(Bwy + €2b1) + £9 > eo, (1) 
&2 = 0, 
and 
min $||Buw2 + e€gbg||? + c2es & 
W2, be, &1 
s.t. (Awe + €1b2) +& > 1, (2) 
& 2 0, 


where c; and cz are predetermined trade-off factors between the error variable 
vectors €; and €. Also, e; and eg are vectors of ones with appropriate 
dimensions. The first term in the objective function of (1) (or (2)) aims to 
maintain the hyperplane in proximity to the points of one class (I*), while 
the constraints work to ensure that the hyperplane remains at a unit distance 
from the points of the other class (I~). The Wolfe dual forms of (1) and (2) 
are given by 

max eg a — 407 G(H"H)'GTa 


(3) 


s.t.0<a< ceo 
and 
max e3 8 — $87 P(Q7Q)'PTB 
B (4) 


s.t.0< 6 < coer, 
where G = [B; e2], H = [A;e1], P = [A;e1], and Q = [B;e2]. As we see, dual 
models involve using the inverse of GG and H’H, which are multiplied by 
Lagrangian multipliers a € R™ and 6 € R™. Finally, the nonparallel hy- 
perplanes are obtained from the solutions a and f of (3) and (4), respectively, 
through 
z, = —(H7H)1GTa, where z = [wi bi] (5) 


and 
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z1 = —(Q7Q)'P™B, where z2 = [wd bo]. (6) 


While TWSVM handles smaller QPPs compared to SVM, it is not without 
drawbacks [32]. Firstly, TWSVM, in addressing its primal problems, exclu- 
sively minimizes empirical risk, neglecting the essential aspect of minimizing 
structural risk present in conventional SVMs. Secondly, to handle singularity 
concerns, TWSVM employs approximations by substituting inverse matrices, 
leading to solutions that are only approximative. Thirdly, the computational 
complexity of TWSVM is impeded by the necessity to compute inverse ma- 
trices, rendering it impractical for extensive datasets. Moreover, TWSVM is 
confined to linear classification and lacks a straightforward extension to non- 
linear scenarios. The demand for swift solvers, such as the SMO algorithm 
used for standard SVMs, adds another layer of complexity. Lastly, TWSVM 
compromises sparsity by employing a quadratic loss function, resulting in a 
situation where the majority of points in a class exert substantial influence 
on each decision function, consequently forfeiting the advantages associated 


with sparsity. 


2.2 The NPSVM 


The NPSVM, which is a generalized version of TWSVM, provides a more 
comprehensive formulation than TWSVM and determines two nonparallel 
hyperplanes using a similar approach. The key difference is that NPSVM 
represents each class within insensitive tubes (Figure 1) and inherits the 
advantages of SVM that TWSVM lacks, such as utilizing the SMO algorithm 
and avoiding the computation of matrix inverses during its model training 
process. The NPSVM solves the following two QPPs: 


min 5llwil|? + cret (m + 2) + cred &1 
wi, .b1, m1, M2, €1 


s.t. (wy vi +1) < €1 + mi; iel, 
—(wp ry + bi) < e1 + nei, ielt, (7) 
wr aj tb) < -1+ &i ier, 
&i, m, 12 2 9, 
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and 
min 5llwel|? + csez (Mm, +m) + caret 1 
we, b2, 71, 75, €2 
s.t. ws 2; +bg< e+; wel, 
—(wF x; + b2) < eo +h, ‘el, (8) 
wea; +be >1- by tel, 
£2, 1, ae = 0, 
where c; > 0 (i = 1,...,4) are trade-off factors for error variables 7; and &; 


(i = 1,2). The aim of (7) is to maximize the margin between the hyperplanes 
of e-intensive tube, which can be mathematically expressed as Taq: The 
first and second set of constraints ensure that the positive class is largely 
concentrated within the e-band situated between the hyperplanes w7 x+b, = 
e and (w} x) +b; = —e. The third set of constraints push away negative class 
from the hyperplane w/z +b; = —1 as far as possible. Similar description 
holds for problem (8). 


Figure 1: Illustration of NPSVM. 
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3 Relabel NPSVM 


In contrast to TWSVM, NPSVM eliminates the need for specific matrix 
inversions during model training, making it particularly advantageous when 
dealing with substantial datasets, where traditional methods may become 
extremely challenging or unfeasible. However, akin to the standard SVM, 
NPSVM retains efficiency in addressing large-scale problems by utilizing the 
SMO algorithm. In NPSVM, the introduction of an €-insensitive loss function 
naturally incorporates a regularization term. This characteristic distinguishes 
it from the initial TWSVM or improved TBSVM, with these latter models 
being special cases of the more general NPSVM. Notably, NPSVM reverts 
to the initial TWSVM or TBSVM when the corresponding parameters are 
appropriately chosen. Additionally, the transition from semi-sparseness to 
complete sparseness is promoted within the NPSVM framework [33]. In this 
section, we explore the implementation of the relabeling approach within 


NPSVM, aiming to bolster its robustness against label noise in datasets. 
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° ee e e ee 
Fig.2 (b) 


Figure 2: Original data dataset Figure 2 (a). Optimal separating hyperplanes with (9) 
Figure 2 (b). Instances from the positive dataset that remain within the cintensive tube 
are colored purple, while instances that are excluded from the respective constraint are 


colored green. 


To apply relabeling on NPSVM, initially, we aim for the positive class to 
be predominantly positioned within the e-intensive tube while maximizing 


its distance from the other class. The gap between the hyperplanes of e- 
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intensive tube is controlled via the loss function. This results in enhancing the 
alignment between the nonparallel hyperplanes and the classes they represent. 
Due to the presence of label noise, there are some instances that appear to 
belong to a specific class based on their labels, but they bear a resemblance 
to a different class (the blue color in Figure 2 (a)). To determine whether 
the samples of the positive class are to be included within the e¢-insensitive 
tube or not, a binary variable vector is incorporated into the model. This 
vector determines which instances are removed from the e¢-insensitive tube 
(green color in Figure 2 (b)) and which instances remain inside it (purple 
color in Figure 2 (b)). This binary variable vector is integrated into both the 
constraints and the objective function, effectively preventing the extensive 
relabeling of observations. The first relabeling NPSVM (RENPSVM) model 


is formulated as follows: 


a ae 5llwill? + cret (m + m2) + c2e3 &1 + eget 1 

s.t. (wa; + b1)0u < er + mu, ielt, 
—(wi 2; + b1)014 < €1 + nai, ielt, 
(wha; +bi)(1—- O04) <-1+6i+Mi(1-64), ielt, 
wia, tb, <-14+& iel, 
61; € {0, 1}, ielt, 
wi €E R”", bL ER, 
m, M2, €1 = 0, 


(9) 
where c; > 0 (i = 1,2,3) are trade-off parameters and M, is large positive 
constant that is chosen such that its associated constraint becomes redundant 
when @ = 0. The second term in objective function 7; +72 controls the error 
for the gap between the hyperplanes w/z + bj = € and wi x+b,; =—e. In 
the third set of constraints, we strive to distance the negative class from the 
hyperplane w/z +b; = —1. The error vector €; is assessed using the soft 
margin loss function. The binary variable vector 6; determines whether the 
samples of the positive class are to be included within the e-insensitive tube or 
not. Typically, when 6); = 1, it signifies that the ith instance belongs to the 
positive class. This is represented by the first three sets of constraints in (9). 


The final term in the objective function serves to avoid extensive reassignment 
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of labels to observations, a situation that might result in producing ineffectual 
classifiers. The second RENPSVM model corresponding to other classes is 


as follows: 


min 5llwal|? + cred (mi +1) + c5e3 Eo + ceeg Oo 
we, ,b2, NY, 15, €2,02 


s.t. (w3 2; + b2)02i < €2 +}, iel, 
— (wz xj + b2)02; < €2 + mh, ier, 
(wy a; + be)(1 — O24) > 1 — i — Ma(1— Ox), ier, 
wat, + bo >1- 3, ielt, 
92; € {0, 1}, ier, 


w2z €R”, bo ER, 


M1, No, €2 = 0. 
(10) 


Both models (9) and (10) exhibit nonlinearity in the constraints. To linearize 


those constraints, we introduce variables a, ag, 2, Bo as 


B= wii, Bor = 61014, (11) 


4 = W292i, Aoi = 2994. (12) 


Now by adding the following constraints to (9) and (10) 


wy, — M36; < Bi < wi + M36, ielt, 
— M3(1— 014) < Bi < M3(1— 14), ier; 
by — M301; < Boi < 61 + M361, ieI*, 
— M3(1 — 01:) < Boi < Mg(1 — 15), tel, 
and 
We — My4(02;) < ay < we + My (424), 1éEer, 
= Ma(1 = 6; ) < (eri < M,(1 = 02: ), 1 € rT, 
bz — Ma(02;) < ao; < bz + Ma(62i), tel, 
— My(1 — 62;) < ao; < Ma(1 — 62;), rel. 


We obtain the following problems: 
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inka sgh See $||wil|? + cret (m + m2) + creg &1 + c3et O1 
s.t. BF 2; + Bos < a + mi, ieIt 

— (Bf xi + Boi) < €1 + 2%, ie tt 
wiv; +b, — (87 2% + Boi) < -14+6i+Mi(1-64), tert 
wie +b, <-14+6; 1E I~ 
wi — M34; < Bi < wi + M36;, ie It 
—M3(1— 61;) < Bi < M3(1—- 14), tert 
by — M361; < Bos < 61 + M361, ielt 
—M3(1— 61:) < Bor < M3(1 — 614), ae]t 
BER”, Boi ER, ie It 
1; € {0, 1}, 7elt 
w, ER", ER, 
Mi, Nas 1 = 0, 

and 

ve cant, oo ex, plltmall? + eae (41h) + coe Ge + coe 

s.t. ata; +ag < e+; ier, 

—(a}'x; + aoi) < e2 +m, ier, 
we x; + be — (at + a0:) > 1 — £4; — Mo(1—02:), ie Iq, 
wy & + bz > 1 — bai, ieit, 
We — M4(82;) < ay < we + My(62;), rel, 
—My(1 — 02;) < a; < Ma(1 — 62;), zEl, 
bz — M4(2i) < ao; < be + My(62:), iel, 
—Ma(1 — 62:) < ao; < Ma(1 — 62:), 1el, 
ag ER”, agi € R, zEel, 
02 € {0,1}, ier, 
wy ER”, ER, 
Mh, 2, €2 = 0, 

where M;, with 1 = 1,...,4, represent significant positive constants. As 


known, MIP models are NP-hard problems, and solving large-scale MIP 


problems can be computationally challenging and often requires sophisti- 


cated optimization algorithms and heuristics to find near-optimal solutions 


within reasonable time frames [37]. To compare the proposed MIP mod- 
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els with those in [6], their number of variables and constraints of models 
are provided in Table 1. According to this table, it becomes evident that 
the number of variables for each model of RENPSVM is lower than that of 
RESVM when 7 < mts (for 1 = 1,2), which is the case for datasets where 
the number of features exceeds seven. Also, each RENPSVM MIP model has 
less linear constraints than RESVM when 74 < nt (for ¢ = 1,2), which 
is the case for datasets where n > 7. It should also be noted that despite 
the fact that CRESVM has less linear constraints compared to both RESVM 


and RENPSVM, its demerit is that is has nonlinear constraints. 


Table 1: Number of variables and constraints of RESVM, CRESVM, and RENPSVM 


RESVM CRESVM RENPSVM (13) RENPSVM (13) 
Variables In+3l+n4+1 414+3n4+1 mn+4m+l4+n+1 men+4mg+l+n+1 
Linear constraints 61 + 4In 51 4myn+ 9m, +m24+l Amon + 9m2+m,+1 
Nonlinear constraints 0 21 0 0 


4 Computational experiments 


To demonstrate the effectiveness of RENPSVM, we conducted experiments 
using a set of 10 UCI datasets, as detailed in Table 2. To evaluate the 
models’ resilience to label noise, we executed three distinct experiments for 
each dataset. For the Vertebral dataset, two scenarios are considered. First 
(Vertebrall), distinguishing patients as either Normal (100) or those with 
Disk Hernia (60); and second (Vertebral2), categorizing patients as either 
Normal (100) or Abnormal, with Abnormal encompassing individuals with 
Disk Hernia (60) or Spondylolisthesis (150). These experiments encompassed 
the original datasets, along with two scenarios involving the introduction of 
random label flips in the training data at percentages of 20% and 50%. The 
implementation of all models is carried out in MATLAB 2020 (64-bit) on 
a computer equipped with an Intel Core i5 processor and 4 GB of RAM. 
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Also, RESVM, CRESVM, and RENPSVM models are solved using the CVX- 
Mosek [16]. For all models, the hyperparameters c; (i = 1,...,4) are chosen 
from the set {2"|,i = —8,...,8}, taking into consideration their impact on the 
models’ performance. To mitigate issues like overfitting and bias across all 
datasets, we employed a 10-fold cross-validation methodology. This technique 
partitions the dataset into ten equally sized subsets, as recommended by 
[14]. Subsequently, the models are trained on nine of these subsets, while the 
remaining subset is utilized to compute the prediction error of the models. 
This process is repeated for each of the ten subsets. Finally, the average 
classification accuracy is computed using the following formula: 


TP+TN 
TP+TN+FP+FN’ 


Accuracy = 


where TP, TN, FP, and FN denote the number of true positive, true neg- 
ative, false positive and false negative, respectively. Computational results 


are summarized in Table 3. 


Table 2: Characteristics of datasets 


Datasets Samples Positive Negative Features Classes 
Car 1594 1210 384 7 2 
Haberman 306 225 81 3 2 
Cancer 699 458 241 9 2 
Vertebral 310 60 100 7 3 
Hayes-Ruth 102 51 51 5 2 
Diabetes 768 500 268 8 2 
Ionosphere 351 225 126 34 2 
Votes 435 267 168 16 2 
Heart 270 260 120 13 2 
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Table 3: Performance comparison of all models 


Flip percentage 
0.0 0.2 0.5 
Datasets Method Accuracy(Time) 
RESVM 100(53.40) 79.98(51.38) 51.38(56.04) 
Car CRESVM 100(68.61) 78.41(82.67) 52.01(86.69) 
RENPSVM | 100(161.89) 79.99(99.37) 51.25(73.93) 
RESVM 71.29(449.52) 63.36 (405.02) 48.93(116.95) 
Haberman | CRESVM | 75.438(13.79) — 59.57(290.36) 49.18( 188.52) 
RENPSVM | 72.87(104.42) 66.025(30.28) 55.89 (28.32) 
RESVM 94.85(40.15) 58.50(125.75) 48.93(116.95) 
Cancer CRESVM 96.28(311.55) 69.07(311.35) 48.074 (309.69) 
RENPSVM | 97(45.92) 77.97(42.02)  53.06( 36.01) 
RESVM 100(23.055) 77.07(108.57) 55.01 (108.75) 
Vertebrall | CRESVM | 100(114.80) 76.87(9.23) 55.01 (21.57) 
RENPSVM | 100(16.12) 80(16.21) 56.26(30.89) 
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RESVM | 81.29(347.71) —_68.81(366.69) — 55.005(108.76) 
Vertebral2 | CRESVM | 76.7742(311.88)  64.39(25.41) 51.62(96.93) 
RENPSVM | 79.67(18.47)  69.032(18.06) 55.447(18.82) 
RESVM | 51.09(111.36) — 50.12(107.37) _52.90(108.51) 
Hayes-roth | CRESVM. | 53.51(308.04) 53.92(51.91) 49.22(47.81) 
RENPSVM | 62.54(32.57) 62.30(14.33)  54.63(13.99) 
RESVM | 65.10(115.84) —55.73(114.52) ——-51.53(115.80) 
Diabet CRESVM | 69.52(309.61) —55.75(281.17) 50.51 (305.35) 
RENPSVM | 75.65(43.79) 56.51(34.75) —51.69(36.04) 
RESVM | 84.88(316.98)  56.13(318.21) —_51.85(316.65) 
Ionosphere | CRESVM | 82.9(312.50)  69.22(40.40) —45.48(27.55) 
RENPSVM | 85.18(68.81) — 68.39(65.44) 53.85(65.21) 
RESVM | 95.88(310.98)  76.99(310.41) 50.33 (319.23) 
Votes CRESVM | 94.49(299.53) —-74.41(308.84) —44.17(311.22) 
RENPSVM | 95.62(35.25) —-78.88(32.73) 54.28 (27.02) 
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RESVM 78.89(09.23) 54.81(309.49) 53.387 (359.49) 


Heart CRESVM |  83.70(28.73) 74.41(308.84) 52.96 (72.06) 


RENPSVM | 85.18(21.81) —72.59(326.27) _55.93(319.05) 


According to Table 3, for the original datasets, RENPSVM exhibits su- 
perior accuracy compared to the other models across all datasets except for 
Votes and Vertebral2. Additionally, it demonstrates equivalent accuracy to 
the other models for datasets such as Car and Vertebrall. In the aspect of 
learning time, RENPSVM outperforms CRESVM and RESVM, except for 
Car. Additionally, RENPSVM secures the second-best position in terms of 
learning time for Haberman and Cancer datasets. When considering a label 
flip scenario of 20%, it becomes evident that the accuracy of RENPSVM sur- 
passes that of the other models for all datasets except Heart and Ionosphere. 
However, in terms of learning time, RENPSVM generally outperforms other 
models, except for Car, Vertebrall, Ionosphere, and Heart. Among these, 
Vertebral and Ionosphere are notable as being the second-best in terms of 
learning time. When dealing with a label flip rate of 50%, among all the 
datasets, only Car does not exhibit the highest accuracy with the proposed 
model. Turning to learning time, the proposed model demonstrates superior- 
ity over all other models, except for Car, Vertebrall, Ionosphere, and Heart, 
which secure the position of being the second-best performers. By analyzing 
the comprehensive results presented in Table 3, it becomes evident that as 
the percentage of flipped labels increases, the proposed model exhibits supe- 
rior accuracy compared to the referenced models and demonstrates enhanced 
robustness. 

Next, the modified Friedman test is initially conducted to assess whether 
distinctions exist among the three models. Following this, the Nemenyi post- 
hoc test is utilized to enable the comparison of multiple methods, offering 


pairwise assessments between them. This post-hoc analysis assists in deter- 
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mining the presence of significant differences between the considered meth- 
ods. 

To determine whether the results of the three models in Table 3 differ or 
not, the modified Friedman test is conducted. The Friedman test, being a 
nonparametric statistical test, does not rely on assumptions about the under- 
lying data distribution [9]. For each dataset, individual ranks are assigned 
to all algorithms, with the top-performing algorithm receiving rank 1, the 
second-best algorithm receiving rank 2, and so forth. In cases of ties, average 
ranks are employed. Let r;; denote the rank of the jth algorithm on the ith 
dataset. The test examines the average rank for each algorithm, denoted as 
rj = YUL, rij. To account for the potential conservatism of the Friedman 


test, a modified version of it is calculated as outlined by [18]. 


(v1)? 
FF, = —___—__~ 13 
i Nika 1) =A2? oo 
where V7 is equal to Ree (ya = Re), N represents the num- 


ber of datasets, and k& denotes the number of methods. Furthermore, 
Fy is distributed according to the F-distribution with degrees of freedom 
(k — 1,(k —1)(N —1)). The average accuracy ranks corresponding to Table 
3 are presented in a tabular format as shown in Table 4. The critical value 
at a significance level of a = 0.1 for F(2,18) is determined to be 3.63. 
Considering the average ranks (Table 4), the 7 values for scenarios with 
label flip percentages of 0%, 20%, and 50% are 3.8, 9.8, and 10.05, respec- 
tively. The corresponding Fy values are 2.1111, 8.6471, and 9.0905. Given 
that the Fy values for the 20% and 50% scenarios exceed the critical value 
of F'(3,16) = 3.63, and considering that the rank of RENPSVM is lower 
than that of RESVM and CRESVM, it can be inferred that there exists a 
significant distinction between RENPSVM and the models introduced in [6] 


for these particular scenarios. 
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Table 4: Average accuracy rank of all models 


Flip percentage 
Method 0.0 0.2 0.5 
RESVM 2.3 2.5 2.25 
Average rank | CRESVM 2.2 2.3 2.55 
RENPSVM | 1.5 1.2 1.2 


The Nemenyi post-hoc test serves as a statistical technique utilized for the 
comparison of multiple methods, offering pairwise comparisons to ascertain 
the presence of significant distinctions. To execute this test for pairwise 
comparisons, we compute a parameter known as the critical difference (CD). 
The CD is determined by considering the number of datasets, the number of 
methods, the chosen significance level, and the average rank associated with 
each model from Table 4. When the difference in average ranks between the 
two methods exceeds the CD value, it can be inferred that a noteworthy and 
statistically significant difference exists between those two methods. The CD 
value is calculated as follows: 

CD = qa=0.1 =, 
where the parameter qg represents the critical value, while k signifies the 
number of models, and N denotes the number of datasets. For a significance 
level of 0.1 and considering four methods, the critical value extracted from 
the Nemenyi distribution table amounts to q = 2.3122. Substituting these 
values into the above equation yields a computed CD value of 1.0340. The 
difference between the average ranks of the two models is represented as =. In 
the scenario where the label flip percentage is 0%, we encounter the following 


conditions: 


(1) 


(RENPSVM — RESVM) =|1.5—2.3| = 0.8 < C.D(1.0340), 
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E(RENPSVM — CRESVM) = |1.5 — 2.2| = 0.7 < CD(1.0340), 
When the flip percentage of labels is 20% we have: 


=(RENPSVM — RESVM) =/1.2—2.5| = 1.3 > CD(1.0340), 
=(RENPSVM — CRESVM) = {1.2 — 2.3] = 1.1 > CD(1.0340). 


Finally, in the case when the flip percentage of labels is 50%, we have 


=(RENPSVM — RESVM) = |1.2 — 2.25] = 1.05 > C.D(1.0340), 
=(RENPSVM — CRESVM) = |1.2— 2.55] = 1.35 > CD(1.0340). 


Based on the preceding results, it is evident that a significant difference exists 
between RENPSVM and the other models, except for the original dataset. 


5 Conclusions 


In this paper, we have introduced MIP models based on NPSVM for the pur- 
pose of relabeling noisy data. Our approach effectively refines observation 
labels while simultaneously achieving an optimal solution. We achieve sig- 
nificant reductions in computational costs by strategically avoiding the con- 
sideration of all observations as potential candidates for label adjustments 
in the model. Instead, we concentrate on instances associated with the class 
that the model aims to represent within an ¢€-intensive tube. The inherent 
structure of NPSVM allows for parallel execution of the proposed MIP mod- 
els, resulting in accelerated learning times across the majority of datasets. 
Our findings indicate that, for datasets with a number of features exceeding 
seven, each RENPSVM MIP model has fewer linear constraints and variables 
compared to RESVM, subject to specific conditions. This holds true for the 
majority of datasets. Additionally, the CRESVM model also exhibits fewer 
linear constraints and variables compared to both RESVM and RENPSVM, 
although it introduces the trade-off of incorporating nonlinear constraints. 
The effectiveness of our proposed models is evaluated through experiments 
conducted on 10 UCI datasets. The outcomes showcased that RENPSVM 


models exhibit better performance in terms of classification accuracy and 
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learning, akin to RESVM and CRESVM, respectively, for most datasets and 
as the percentage of flipped labels increases, the proposed RENPSVM model 
demonstrates superior accuracy compared to the referenced models and show- 
cases enhanced robustness. Moreover, we employed the modified Friedman 
test and Nemenyi post-hoc test to assess the influence of label noise on our 
model’s performance relative to other methods. The tests revealed that a 
notable distinction between RENPSVM and other models exists, except for 
the original datasets. For future work, one may consider extending the pro- 
posed model to multi-class classification, either through a one-vs-one-vs-rest 
approach [29] or by adapting it into a regression model. This adaptation 
can be particularly useful for handling label noise in target values, a common 
challenge in regression tasks [30]. Also, when dealing with datasets containing 
a large number of features, the computational cost can become prohibitively 
high. In such cases, it is efficient to derive hyperplane classifiers using the 
dual problem formulation [10]. Therefore, studying label noise using dual 


models might be another interesting future research direction. 
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